downstream system
Retrieval Augmented Structured Generation: Business Document Information Extraction As Tool Use
Cesista, Franz Louis, Aguiar, Rui, Kim, Jason, Acilo, Paolo
Business Document Information Extraction (BDIE) is the problem of transforming a blob of unstructured information (raw text, scanned documents, etc.) into a structured format that downstream systems can parse and use. It has two main tasks: Key-Information Extraction (KIE) and Line Items Recognition (LIR). In this paper, we argue that BDIE is best modeled as a Tool Use problem, where the tools are these downstream systems. We then present Retrieval Augmented Structured Generation (RASG), a novel general framework for BDIE that achieves state of the art (SOTA) results on both KIE and LIR tasks on BDIE benchmarks. The contributions of this paper are threefold: (1) We show, with ablation benchmarks, that Large Language Models (LLMs) with RASG are already competitive with or surpasses current SOTA Large Multimodal Models (LMMs) without RASG on BDIE benchmarks. (2) We propose a new metric class for Line Items Recognition, General Line Items Recognition Metric (GLIRM), that is more aligned with practical BDIE use cases compared to existing metrics, such as ANLS*, DocILE, and GriTS. (3) We provide a heuristic algorithm for backcalculating bounding boxes of predicted line items and tables without the need for vision encoders. Finally, we claim that, while LMMs might sometimes offer marginal performance benefits, LLMs + RASG is oftentimes superior given real-world applications and constraints of BDIE.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > United States > California > Santa Clara County > San Jose (0.04)
- Europe > Switzerland (0.04)
Operationalising Representation in Natural Language Processing
Despite its centrality in the philosophy of cognitive science, there has been little prior philosophical work engaging with the notion of representation in contemporary NLP practice. This paper attempts to fill that lacuna: drawing on ideas from cognitive science, I introduce a framework for evaluating the representational claims made about components of neural NLP models, proposing three criteria with which to evaluate whether a component of a model represents a property and operationalising these criteria using probing classifiers, a popular analysis technique in NLP (and deep learning more broadly). The project of operationalising a philosophically-informed notion of representation should be of interest to both philosophers of science and NLP practitioners. It affords philosophers a novel testing-ground for claims about the nature of representation, and helps NLPers organise the large literature on probing experiments, suggesting novel avenues for empirical research.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > Portugal > Lisbon > Lisbon (0.04)
- Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
- (13 more...)
Optimizing AD Pruning of Sponsored Search with Reinforcement Learning
Lian, Yijiang, Chen, Zhijie, Pei, Xin, Li, Shuang, Wang, Yifei, Qiu, Yuefeng, Zhang, Zhiheng, Tao, Zhipeng, Yuan, Liang, Guan, Hanju, Zhang, Kefeng, Li, Zhigang, Liu, Xiaochun
Industrial sponsored search system (SSS) can be logically divided into three modules: keywords matching, ad retrieving, and ranking. During ad retrieving, the ad candidates grow exponentially. A query with high commercial value might retrieve a great deal of ad candidates such that the ranking module could not afford. Due to limited latency and computing resources, the candidates have to be pruned earlier. Suppose we set a pruning line to cut SSS into two parts: upstream and downstream. The problem we are going to address is: how to pick out the best $K$ items from $N$ candidates provided by the upstream to maximize the total system's revenue. Since the industrial downstream is very complicated and updated quickly, a crucial restriction in this problem is that the selection scheme should get adapted to the downstream. In this paper, we propose a novel model-free reinforcement learning approach to fixing this problem. Our approach considers downstream as a black-box environment, and the agent sequentially selects items and finally feeds into the downstream, where revenue would be estimated and used as a reward to improve the selection policy. To the best of our knowledge, this is first time to consider the system optimization from a downstream adaption view. It is also the first time to use reinforcement learning techniques to tackle this problem. The idea has been successfully realized in Baidu's sponsored search system, and online long time A/B test shows remarkable improvements on revenue.
Change Data Capture (CDC) and Kafka
Change Data Capture (CDC) is an approach to data integration that is based on the identification, capture, and delivery of the changes made to data sources, typically relational databases. Change operations can be the INSERT of a new record, an UPDATE or DELETE of an existing record. With Apache Kafka and in particular with the Kafka Connect APIs and the Kafka Connect source connectors available it's very easy to create data pipeline which will capture and deliver changes from an existing RDBMS to a Kafka cluster. From there you can send those changes to downstream systems, typically NoSQL storage systems (such as Cassandra, MongoDB, Couchbase, etc.) or search engines (such as Elasticsearch). It is also possible and advisible to keep changes stored or cached in a Kafka compacted topic, this way if you want to perform parallel joins via Kafka Streams or KSQL, the joins will be done easily and efficiently in parallel with no repartitioning necessary.
- Information Technology > Information Management (0.92)
- Information Technology > Artificial Intelligence (0.57)
- Information Technology > Communications > Social Media (0.40)
- Information Technology > Data Science > Data Mining > Big Data (0.32)